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Abstract 

As a performance measure for a prediction model, the area under the receiver operating characteristic curve (AUC) Is 
insensitive to the addition of strong marl<ers. A number of measures sensitive to performance change have recently been 
proposed; however, these relative-performance measures may lead to self-contradictory conclusions. This paper examines 
alternative performance measures for prediction models: the Lorenz curve-based GInl and Pietra indices, and a standardized 
version of the Brier score, the scaled Brier. Computer simulations are performed in order to study the sensitivity of these 
measures to performance change when a new marker is added to a baseline model. When the discrimination power of the 
added marker Is concentrated In the gray zone of the baseline model, the AUC and the GInl show minimal performance 
improvements. The Pietra and the scaled Brier show more significant Improvements In the same situation, comparatively. 
The Pietra and the scaled Brier Indices are therefore recommended for prediction model performance measurement. In light 
of their ease of interpretation, clinical relevance and sensitivity to gray-zone resolving markers. 
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Introduction 

Risk prediction models are important for both patients and 
physicians alike. A prediction model can be used to integrate an 
individual's socio-demographic variables, medical histories and 
biomarker values, etc., and to translate them into a disease risk, 
upon which prognostication and/or treatment decision can be 
based. Examples are the prediction models for cardiovascular 
diseases [1], hypertension [2], diabetes [3] and different forms of 
cancer [4—6] . Prediction model performance must be evaluated in 
a scientific way. There are two aspects to model performance: 
calibration and discrimination. Calibration is a measure of how 
well predicted probability agrees with actual observed risk, while 
discrimination is a measure of how well a model separates those 
who do and do not have the disease of interest [7] . This study 
focuses on evaluating the discrimination ability of a prediction 
model. 

The area under the receiver operating characteristic (ROC) 
curve (AUC) (also referred to as the c statistic) is by far the most 
popular index of discrimination ability [8] . AUC is defined as the 
probability that the predicted probability of a randomly selected 
diseased subject will exceed that of a randomly selected non- 
diseased subject. AUC is a value between 0.5 and 1.0, with a 
higher value indicating better prediction performance. A predic- 
tion model with an AUC value of 0.5 is no better than tossing a 
coin, and at the other extreme, a model with a 1.0 AUC value is a 
perfect model, with 100% accurate predictions. However, AUC 
has been criticized as insensitive to the addition of strong 
marker(s), typically resulting in only small changes in value 
[9,10]. A small change in AUC (AAUC), even though it is 
statistically significant, can be difficult to interpret. For example, 
the addition of C-reactive protein to a set of standard risk factors 



predicting cardiovascular disease only increases the model AUC 
from 0.72 to 0.74 [11], and the AAUC is a mere 0.001 (from 
0.900 to 0.901) when a genotype score (derived from a total of 18 
alleles) is added into the prediction model for type 2 diabetes [3]. 
One cannot help wondering whether this is because the C-reactive 
protein and the genotype score (despite their strong associations 
with the disease) are actually useless in disease prediction, or 
whether the AUC's insensitivity to model performance change is 
entirely to blame. 

Recently, a number of 'relative-performance' indices that are 
sensitive to performance change have been proposed [12]. These 
measures specifically compare models with and without new 
markers, and include net reclassification improvement (NRI), 
continuous NRI (cNRI) and integrated discrimination improve- 
ment (IDI) [13,14]. NRI is defined as the difference between the 
proportion of subjects 'moving up' (changing to higher risk 
categories in the model witii the new marker(s)) and the proportion 
of subjects 'moving down' (changing to lower risk categories) for 
diseased subjects, and the corresponding difference in proportions 
for non-diseased subjects [13]. cNRI and IDI also hinge on such 
up and down movement. In cNRI, any increase (decrease) in 
predicted probability constitutes a movement up (down) [14]. In 
IDI, the actual amount of increase/decrease in predicted 
probability is counted [13]. However, a relative-performance 
measure can sometimes lead to self-contradictory conclusions. For 
example, a situation may occur in which the prediction 
performances of models A, B and C are rated, using a relative 
performance index, as A>B and B>C, yet paradoxically, A<C. 

This paper describes and compares a number of alternative 
performance measures for prediction models. These include 
the Lorenz curve-based Gini and Pietra indices [15] and a 
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standardized version of the Brier score, the scaled Brier (sBrier) [7]. 
All these are absolute measures, directly reflecting the prediction 
performance of a specific model, and when used for model 
comparisons they do not produce self-contradictory results. The 
sensitivity of these measures to performance change when new 
marker(s) are added to a baseline model will also be examined. 

Methods 

Formulas for Various Performance Measures 

Assume that there are a total of n subjects (indexed i) in a 
population, of which n\ [i= \,...,n\) subjects are diseased (A = 1), 
and «2 (i = n\ + 1, ...,«) subjects are non-diseased (D, =0). Assume 
a prediction model which yields a predicted probability, Pi, for 
each and every subject in the population. The prediction model is 
well calibrated and unbiased such that the mean predicted 
probability, p, is equal to disease prevalence in the population, that 
is, p = n\ln . Figure 1 presents the [:omputing formulas and 
interpretations of various performance measures, including AUG, 
Gini, Pietra and sBrier. 

The formula for AUG is 



AUC = 



«1 « . s 

«=ly = /7l-hl ^ ^ 



«1 X«2 



where s(^f,p^ is a scoring function comparing the predicted 

probabilities for a pair of subjects: s{^i,p^ = \ it pi>Pj, 0.5 if 

Pi=Pj, and 0 if otherwise. The formula clearly shows that AUG is 
the probability that the predicted probability of a randomly 
selected diseased subject exceeds that of a randomly selected non- 
diseased subject. 

It is of interest to compare the computing formulas for Gini, 
Pietra and sBrier: 
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and 
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mean squared gain for the current model 
mean squared gain for an error-free model ' 



respectively. Note that initially, all subjects in the population are 
on the same footing - the same a priori probability (p). When a 
prediction model is used, however, they diverge (p,s are diflFerent 
in general). Gini quantifies the "separation" (subject-to-subject 

variation in the a posteriori probability) of a model, while Pietra and 
sBrier quantify the "gain" (deviation of the a posteriori probability 
from the a priori probabiHty). 

Simulation Schemes 

Three variables are assumed to be predictive of a particular 
disease [D): the baseline score (S) and two new markers 
(Ml and Mj). It is assumed that 5 is a composite of traditional 
risk factors (age, smoking, systolic blood pressure, total and high 
density lipoprotein cholesterol levels, etc.) standardized to a 
normal distribution with a mean of 0 and a standard deviation 
of 1. The new markers are assumed to be binary. In order to 
acknowledg(; a correlation bct\\x'cn S and the two new markers, let 
the prevalence of Afi and M2 be 85% when S is above average 
(iS'>0), and 75%, when otherwise. 

It is assumed that the discrimination power of M\ is 
independent of tlie baseline score, whereas the discrimination 
power of A/2 is not uniform, bur is concentrated in the gray zone 
of the baseline model (where the predicted probability using the 
baseline model is close to the a priori probabiUty). Specifically, the 
disease risk is assumed to follow a logistic model, as below: 

logitPr(Z) = l|5,Mi,Jl/2) 

= - 3 -1-2 X 5-1- 1.5 X Ml -h 2.2 X K(B) x M2, 

where K(x) is a Gaussian kernel function centered at 0: 
K(x) = exp( — ) . In this model, the disease odds ratio 

per unit increase in the baseline score (disease odds ratio for one 
standard deviation increase in the composite variable of traditional 
risk factors) is exp(2) = 7.4. To simulate new markers that are 
strong predictors for the disease, we let the disease odds ratio for 
Ml to be exp(1.5) = 4.5 irrespective of the baseline score 
(Figure 2), and the disease odds ratio for M2 to reach a peak 
[exp(2.2) = 9.0] when the basehne score is at its average value 
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Performance Measure 



AUC 



Gini 



Pietra 



sBrier 



Formula 



2xn' X px(l- p) 



1JP.-P\ 

1 = 1 

2xnx px{l- p) 



nx px{l- p) 



Interpretation A probability that the predicted A measure of "separation" A measure of "gain" (deviation A measure of "squared gain" 

probability of a randomly (subject-to-subject variation in of the a posteriori probability (deviation of the a posteriori 

selected diseased subject the a posteriori probability) of from the a priori probability). probability from the a priori 

exceeds that of a randomly a model. probability), 
selected non-diseased subject. 



Figure 1. Computing formulas and interpretations of various performance measures. 

doi:1 0.1 371 /journal.pone.0091 249.g001 



(5 = 0) and rapidly decay when the baseline score is above or 
below average (Figure 2). 

A total of 500 subjects were simulated as the training sample, 
and another 500 subjects were simulated as the validation sample. 
The performances of three prediction models were compared: (I) 
the model with the baseline score only, (II) the model with the 



baseline score plus M\ and (III) the model with the baseline score 
plus Mj. A total of 10000 simulations were performed. 

Results 

In Figure 3, it can be seen that there is almost no change in the 
distributions of the predicted probabilities between the baseline 
model (A) and the model with My added (B). Using the AUG 



M2 



o —I 

I 1 1 1 1 

^-2 0 2 4 

Baseline score 

Figure 2. Disease odds ratios (discrimination powers) of tKie new mariners (Mi andM2) (solid line: when the discrimination power of 
the new marker (Mi) is independent of the baseline score; dotted line: when the discrimination power of the new marker (Mj) is 
concentrated in the gray zone of the baseline model). 

doi:1 0.1 371 /journal.pone.0091 249.g002 
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Figure 3. Distribution of tlie predicted probabilities for a baseline model (A), and the model with the new marker M; added (B), or 
M2 added (C). The discrimination power of Mi is independent of the baseline score, and that of is concentrated in the gray zone of the baseline 
model. The solid vertical bar indicates the grand mean of the predicted probabilities, and the two dotted vertical bars, the means of the predicted 
probabilities for the diseased subjects and the non-diseased subjects, respectively. 
doi:1 0.1 371 /journal.pone.0091 249.g003 



index, it can be seen tliat adding Mi increases the prediction 
performance of the model from 0.822 to 0.841, an absolute 
(relative) improvement of a mere +0.019 (+2.3%) (Table 1). Noted 
that the absolute improvement gauged by the Gini index (+0.039) 
is twice that by AUG (apart from the rounding error; in fact, 
Gini = 2 X AUG— 1, see [15]), and the relative improvement is 
+6.1%. The Pietra [+0.036 (+7.4%)] and the sBrier [+0.038 
(+12.4%)] also demonstrate more significant improvements than 
that of AUG. 

By contrast, the results are much more intriguing when M2 is 
added. In Figure 3, the number of people (diseased or non- 
diseased) in the gray zone (near the solid vertical bars) is drastically 
reduced when M2 is added (G) to the baseline model (A); most 
diseased individuals move to the right (higher predicted probabil- 
ity), whereas most non-diseased individuals move to the left. An 



informative marker like M2 certainly deserves a high rate; 
however, the AUG credits it with an absolute (relative) improve- 
ment in prediction performance of only +0.022 (+2.7%), and the 
Gini, twice that value, but still only +0.043 (+6.7%) (Table 1). 
Gomparatively, the Pietra [+0.083 (+17.1%)] and the sBrier 
[+0.057 (+18.6%)] indices more fittingly judge the value of the 
marker. 

It is also of interest to compare models "B + Mi" and "B + M2" 
head to head. Figure 3 shows that the two models generate 
predicted probabilities that are quite different in distribution (B vs. 
G); however, AUG and Gini fail to set them apart (AUG: 0.841 vs. 
0.844; Gini: 0.683 vs. 0.687). By contrast, Pietra and sBrier clearly 
differentiate between the two models (Pietra: 0.521 vs. 0.568; 
sBrier: 0.344 vs. 0.363). 
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Table 1. Improvennents in prediction performances when 
respectively. 


new markers, M\ and M2, are added to a baseline model {B), 




Performance Measure 




AUC 


Gini 


Pletra 


sBrier 


Model 


B 


0.822 


0.644 


0.485 


0.306 


S+M, 


0.841 


0.683 


0.521 


0.344 


B+M2 


0.844 


0.687 


0568 


0.363 


Absolute (Relative) Improvement 


from B to B+M, 


+0.019 (+2.3%) 


+0.039 (+6.1%) 


+0.036 (+7.4%) 


+0.038 (+12.4%) 


from 6 to B+M2 


+0.022 (+2.7%) 


+0.043 (+6.7%) 


+0.083 (+17.1%) 


+0.057 (+18.6%) 



The discrimination power of M\ is independent of the baseline score, whereas that of Mi is concentrated in the gray zone of the baseline model. 
doi:l 0.1 371/journal.pone.0091 249.t001 



In addition, this study examined situations when a strong 
continuous-scale marker (Exhibit SI) and multiple weak binary 
markers (Exhibit S2; to simulate genetic markers that are by 
themselves weak predictors for the disease but are strongly 
predictive of the disease if used collectively as a genetic score) 
were added to the baseline model, respectively. The conclusions 
regarding the comparisons of the various performance indices 
remain the same as when one strong binary marker is added, as 
shown above. 

Discussion 

ROC curve analysis is the most widely used method for the 
evaluation of diagnostic test or prediction model performance [16- 
19]. For any subject to be diagnosed/predicted, a diagnostic test 
yields a single test value which, depending on the test used, can be 
in binary, ordinal or continuous scale, whereas a prediction model, 
upon integrating the information of more than one predictor, 
produces a probability, which is a value between 0 and I . Lorenz 
curve analysis has also enjoyed a long history of use, dating back to 
1905 [20]. However, it has been primarily used by economists 
(demographers) to study inequality in income (population) 
distribution [21,22]. Lee [15] pioneered the use of Lorenz curve 
analysis in biomedicine (in the context of diagnostic test 
evaluation, although he did not consider prediction models). The 
interpretation of the ROC curve-based AUC index is actually 
rather unrealistic - subjects will not come in pairs, one being 
diseased and the other non-diseased, with their predicted 
probabilities to be compared. By contrast, Lorenz curve-based 
Gini and Pietra indices follow-up study subjects from their a priori 
probabilities to their a posteriori probabilities (after using a 
prediction model), and should have more relevance for actual 
clinical practices. 

Brier score has been used to evaluate the accuracy of weather 
forecasting since 1950 [23]. In recent decades it has seen use in 
applications in biomedical fields [24—26]. Brier score depends on 
the disease prevalence (the a priori probability) of the population 
where the prediction model is buUt, and therefore it is unsuitable 
for making a comparison between populations. Steyerberg et al. 
[7] proposed a standardized version of the Brier score, the sBrier, 
which is an index between 0 and 1 , and is prevalent-independent. 
Austin and Steyerberg [27] used sBrier to examine performance 
changes when new markers were added to a baseline model. 
However, they did not consider the type of markers with 
discrimination power concentrating in the gray zone of the 



baseline model, and therefore did not recognize that sBrier was 
sensitive to gray-zone resolving markers. Another, lesser known 
fact about sBrier is that the change in sBrier upon addition of new 
markers is equal to the IDI index itself. A proof of this is given in 
Exhibit S3. 

It is worth noting that Gini, Pietra and sBrier indices can be 
expressed as ratios, comparing the resolution power (separation for 
Gini; gain for Pietra; squared gain for sBrier) of the current model 
with that of an error-free model. They are all therefore indices 
between 0 and 1, and can be neatly interpreted as a per cent 
maximum resolution power of the current model. In Table 1, the 
prediction performances of the baseline model are 0.644 (Gini), 
0.485 (Pietra), and 0.306 (sBrier), respectively. This means that the 
baseline model still has a great deal of room for improvement; 
currently, it only achieves 64.4% separation/48.5% gain/30.6% 
squared gain of a sure-fire prediction model. 

In this study, it is felt that patients (and their physicians) should 
be more interested in the gain (or squared gain) of a model (this 
tells how much their disease probability could be expected to be 
revised if they use that model), than in the separation (this 
compares two randomly chosen people). This study found that the 
two indices that quantify gains (Pietra and sBrier) are also those 
that are most sensitive to gray-zone resolving markers. 

Taken together, Pietra and sBrier are promising alternative 
prediction model performance measures, in light of their ease of 
interpretation, cUnical relevance and sensitivity to gray-zone 
resolving markers. Further work is needed to fully develop the 
statistical inference procedures (hypothesis tests and confidence 
intervals etc.) regarding these two indices. 

Supporting information 

Exhibit SI Simulation when a strong continuous-scale 
marker is added to the prediction model. 

(PDF) 

Exhibit S2 Simulation when multiple weak binary 
markers are added to the prediction model. 

(PDF) 

Exhibit S3 A proof that the change in sBrier upon 
addition of new marker(s) is equal to the IDI index. 

(PDF) 
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